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Abstract 

In  this  memorandum  we  present  an  informally  argued 
derivation  of  the  properties  of  topographic  vector 
quantisers  in  the  limit  of  a  large  codebook  size.  In 
particular,  we  prove  that  the  code  vector  density  does 
not  depend  on  one's  choice  of  neighbourhood 
function,  provided  that  we  use  the  minimum  distortion 
(rather  than  the  nearest  neighbour)  encoding 
prescription.  This  result  suggests  that  widespread  use 
of  the  nearest  neighbour  prescription  in  topographic 
mapping  networks  is  fundamentally  misguided.  It 
would  be  advisable  to  remember  that  the  nearest 
neighbour  prescription  is  assumed  not  derived ,  so  its 
adherents  must  accept  defeat  gracefully. 


Copyright 

© 

Crown  Copyright 
1992 


THIS  PAGE  IS  INTENTIONALLY  LEFT  BLANK 


Stephen  P  Luttrell,  6  August  1992 


Table  of  Contents 


1.  Introduction . 1 

2.  Outline . . . 1 

3.  Covariance  matrices . 2 

4.  Approximations . 4 

4. 1 .  Nearest  neighbour  encoding . 4 

4.2.  Covariance  matrix . 5 

4.3.  Lattice  of  code  vectors . 6 

5.  Minimise  the  distortion . 7 

5. 1 .  Code  vector  density . 7 

5.2.  Stationary  distortion . 7 

5.3.  Stationary  code  vector  density . 8 

5.4.  Special  case:  vector  quantiser . 9 

6.  Conclusions . 9 

7.  Notation . 10 

8.  References . 1 1 


i 


Code  Vector  Density 


THIS  PAGE  IS  INTENTIONALLY  LEFT  BLANK 


4 


ii 


Stephen  P  Luttrell,  6  August  1992 


1.  Introduction 

In  this  memorandum  we  concern  ourselves  with  sortie  interesting  theoretical  properties  of 
unsupervised  adaptive  networks  (specifically,  topographic  mappings  [1]).  This  is  part  of  the 
outcome  of  a  programme  of  research  whose  purpose  is  to  develop  a  more  rigorous 
theoretical  underpinning  for  topographic  mappings  than  has  hitherto  appeared  in  the 
literature. 

The  asymptotic  properties  of  topographic  mappings  have  been  the  focus  of  some  recent 
research  [2,3,4].  The  derivations  in  [4]  concentrated  on  standard  topographic  mappings  [1], 
whereas  [3]  concentrated  on  a  slight  variation  of  this  approach  which  used  a  vector 
quantiser  model  [5,6]  to  develop  topographic-like  mappings.  For  convenience,  we  call  such 
a  model  a  topographic  vector  quantiser,  to  emphasise  its  relationship  to  both  topographic 
mappings  and  to  vector  quantisers.  Both  of  these  studies  were  limited  to  the  mapping  of  a 
one  dimensional  input  space  to  a  one  dimensional  output  space. 

In  [4]  an  asymptotic  density  of  weights  (i.e.  one  dimensional  weight  vectors)  p  «  Pa  with 
a=((2HM-l  )2/3)/((hh-1  )2+w2)  emerged,  where  w  described  the  half-width  of  a  symmetric 
topographic  neighbourhood  function.  In  [3]  the  much  simpler  result  a=l/3  emerged, 
assuming  a  symmetric,  monotonically  decreasing  topographic  neighbourhood  function.  This 
a=l/3  result  is  the  same  as  is  obtained  in  a  standard  scalar  quantiser  [7]. 

In  this  note  we  derive  a  vector  generalisation  of  the  scalar  results  that  we  presented  in  [3], 
which  can  be  applied  to  the  problem  of  mapping  an  N  dimensional  input  space  to  an  n 
dimensional  output  space  ( n<N ).  The  result  p  PW(N+2)  (which  reduces  to  Pin  in  the  one 
dimensional  N=  1  case,  as  expected)  is  already  known  to  hold  for  a  standard  vector  quantiser 
[8],  and  the  critical  question  is  whether  this  also  applies  to  our  variant  of  topographic 
mappings,  as  it  did  in  the  one  dimensional  case.  It  turns  out  that  it  does,  but  the  derivation  is 
rather  involved. 

2.  Outline 

Our  derivation  rests  on  three  critical  steps. 

1.  We  assume  that  we  can  ignore  second  order  effects  such  as  curvature  of  the  surface  in 
which  the  code  vectors  sit.  This  assumption  may  be  satisfied  by  choosing  a 
topographic  neighbourhood  function  whose  mass  is  concentrated  in  the 
neighbourhood  of  a  single  value,  and  then  ensuring  that  the  number  of  code  vectors  is 
sufficiently  large  that  the  surface  formed  by  the  code  vectors  is  locally  smooth.  This 
allows  us  to  approximate  the  expression  for  the  average  Euclidean  distortion  as  the 
sum  of  a  pair  of  covariance  matrices  (or  inertia  tensors):  the  covariance  of  each 
quantisation  cell,  plus  the  covariance  of  the  cells  in  its  topographic  neighbourhood. 

2.  We  assume  that  the  topographic  neighbourhood  function  is  translation  invariant,  to 
show  that  the  relationship  between  these  two  types  of  covariance  matrix  does  not 
depend  on  one’s  location,  apart  from  trivial  rotation  factors.  Thus  we  express  the 
Euclidean  distortion  solely  in  terms  of  the  covariance  of  each  quantisation  cell. 
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3.  We  define  a  code  vector  density  in  terms  of  this  covariance.  This  allows  us  to  hold 
constant  the  total  number  of  code  vectors  whilst  we  minimise  the  average  Euclidean 
distortion  with  respect  to  the  covariance  of  each  quantisation  cell. 

Throughout  our  calculations  we  make  extensive  use  of  probabilities.  We  do  this  in  order  to 
simplify  the  notational  problems  that  can  arise  when  we  quantities,  such  as  "the  set  of  all 
inputs  that  encode  to  give  a  particular  code".  Also,  we  make  extensive  use  of  Bayes’ 
theorem  to  manipulate  the  probabilities  into  various  forms,  as  required.  We  do  not  claim 
that  the  use  of  probabilities  (plus  Bayes'  theorem)  is  a  necessary  part  of  our  calculations, 
but  it  certainly  makes  them  much  easier  to  perform,  because  the  notation  does  most  of  the 
work  for  us. 

3.  Covariance  matrices 

Firstly,  define  the  average  Euclidean  distortion  (or  Lyapunov  function)  for  encoding  via 
y(x),  adding  noise  via  P(y'\y),  and  then  decoding  via  x'(y')  as 

D  =  JdxP(x)Jd/P(/|y(x))|x-x'(/)f  (1) 

The  integration  over  y  takes  account  of  the  various  possible  distortions  that  might  be 
applied  to  the  code  y,  and  the  integration  over  x  averages  over  the  various  possible  input 
vectors  that  might  be  presented.  Although  this  definition  of  D  uses  an  distortion  metric, 
our  results  can  easily  be  generalised  to  an  L,  distortion  metric,  as  we  shall  show  later  on. 

Note  that  at  this  stage  the  encoding  and  decoding  functions  y(x)  and  x'(y')  have  not  yet 
been  specified;  it  is  minimisation  of  D  with  respect  to  the  choice  of  these  functions  that 
determines  their  actual  form.  We  shall  deal  with  x'(y')  immediately,  and  defer  y(x)  until  later 
on. 

Define  the  average  vectors 
*{y)  =  Jdx  P(x|y)x 

*'(/)  =  JdxP(x|/)x  =  Jdy  P(y|/)x(y)  (2) 

where  we  have  implicitly  used  Bayes’  theorem  to  construct  the  posterior  probabilities 
P(xly),  P(x\y')  and  P(y\y').  Note  that  in  Equation  2  we  define  x'(y')  so  that  it  satisfies 
dD/dx'(y')=Q,  so  henceforth  we  do  not  need  to  worry  about  minimisation  of  D  with  respect 
to  x'OO-  We  make  the  other  two  definitions  for  later  use,  and  present  them  at  this  stage 
merely  for  convenience. 

We  wish  to  manipulate  D  into  a  form  in  which  it  is  expressed  as  an  integral  over  y-space,  so 
that  we  can  identify  how  much  distortion  is  associated  with  each  code  vector.  We  therefore 
have  to  invert  the  order  of  the  integrations  in  Equation  1.  In  order  to  do  this  we  must 
express  the  encoding  operation  y(x)  using  probability  notation.  Thus 

Dsfdx  P(x)j  dy'  P(y'|y)  J  dy  P(y|x)||x  -  x'(y')f  (3) 
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where  />(yUr)s5(y-^(x)).  Now  we  manipulate  the  probabilities  using  Bayes'  theorem  and  the 
fact  that  x-*y-*y'  is  a  Markov  chain,  to  obtain 

/>(x)P(y'|y)/>(y|x)  =  P{x,y,y')  =  />(y)/>(y'|y)P(x|y)  (4) 

whence 

D  =  \dy  P{y)  j  dy'  P{y'\y)\  dx  /’(xlyjjx  -  x'(y')f 
=  trace  (jdyPOOoQo) 

where  trace  zzT  =  llzll2,  and  where  we  have  defined  the  covariance  matrix  o(y)  as 

o(y)  =  J  dy'  P{y'\y)\  dx  P(x|y)(x  -  x'(y'))(x  -  x'(/))T  (6) 

We  have  successfully  arranged  D  as  an  integral  over  y-space,  and  we  have  "opened  up"  the 
norm  to  reveal  the  covariance  matrix  that  is  hidden  within.  Our  use  of  a  covariance  matrix 
appears  to  be  an  unnecessary  step  at  this  point,  but  it  turns  out  to  be  essential  in  order  to 
define  a  code  vector  density  later  on.  We  therefore  use  (traces  of)  covariance  matrices, 
rather  than  norms,  throughout  our  calculations. 

We  may  split  o(y)  into  a  sum  of  more  primitive  pieces  as  follows. 

1.  Use  the  identity  (x-x'0,))s(x-x(y))-(x,(y')-Jc(y))  to  obtain  a  pair  of  terms 
(x-x(y))(x-x(y))T  and  (x'(y')-x(y))(x'(y')-x(y))T.  Note  that  the  cross  term  vanishes 
when  x  is  integrated. 

2.  Use  the  identity  (x'(y')-x(y))H(x'(yO-*oOOM*00-*oOO)  to  replace  the  second  of  these 
terms  by  another  pair  of  terms  ( x'(y')-x0(y))(x'(y')-x0(y))T  and  (x(y)-x0O))(x(»-x0(y))7'. 
Note  that  that  the  cross  term  vanishes  when  is  y'  integrated. 

3.  Define  some  covariance  matrices. 

O0(^) s ^Ub)U-Jf(>))(jf-Jf(y))T 

dy'  P(/|y)(x '(/)  -  x0  MXx'b')  -  *o  (y))T  (7) 

to  yield  finally 

o(.y)=Oo(y)+0iOO+02(.y)  (8) 

We  may  interpret  a0(y),  o,(y)  and  a2(y)  as  follows. 

Consider  o0(y)  first  of  all.  x(y)  is  the  centroid  of  x  (given  y),  which  we  obtain  by  integrating 
(using  as  a  "measure"  the  posterior  probability  P(x\y))  over  the  continuum  of  inputs  x.  The 
vector  x-x(y)  is  the  displacement  of  x  relative  to  this  centroid,  so  o0(y)  is  the  covariance  of 
x  (given  y). 

Now  consider  o,(y).  In  this  case  we  integrate  over  y',  not  over  x  as  we  did  for  c0(y). 
Because  y  actually  corresponds  to  a  discrete  index  in  the  codebook,  this  integral  is  a  sum 
(using  as  a  "measure"  the  probability  P(y'ly))  over  code  indices.  In  this  case  x0(y)  is  the 
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centroid  of  x'OO  (given  y),  and  the  vector  x'(yO  -  x(y)  is  the  displacement  of  x'(y')  relative 
to  this  centroid,  so  o,(y)  is  the  covariance  of  x'(yO  (given  y). 

Finally  consider  o2(y).  x(y)  and  x0(y)  are  the  centroids  used  to  define  o0(y)  and  Cj(y), 
respectively.  o2(y)  is  the  dyadic  constructed  from  the  displacement  of  these  centroids  from 
each  other. 

We  should  point  out  at  this  stage  that  our  calculations  are  exact  thus  far.  In  order  to  make 
any  further  progress  we  need  to  make  approximations,  which  we  shall  discuss  in  detail. 

4.  Approximations 

In  this  Section  we  shall  introduce  several  approximations  that  are  essential  to  the  analysis  of 
the  leading  order  properties  of  topographic  vector  quantisers.  The  overall  goal  of  this 
Section  is  to  develop  an  approximate  expression  for  the  covariance  matrix  o(y)  in  the  form 
G{y)=M{y)o0(y)MT(y),  where  M(y)  is  a  transformation,  in  order  to  express  the  distortion  D 
in  terms  of  o0(y)  alone,  rather  than  o0(y)+CT1(y). 

In  Section  4.1  we  introduce  a  very  useful  approximation  for  the  encoding  function  y(x), 
which  allows  us  to  replace  minimum  distortion  encoding  prescription  by  an  approximately 
equivalent  nearest  neighbour  encoding  prescription.  Note  that  this  is  not  simply  a  naive 
replacement  of  minimum  distortion  by  nearest  neighbour.  In  Section  4.2  we  use  this 
approximation  for  y(x)  to  write  an  approximate  expression  for  the  covariance  matrix  c(y), 
which  is  amenable  to  further  analysis,  unlike  the  exact  expression.  In  Section  4.3  we 
introduce  a  local  lattice,  and  translation  invariance  P(y'\y)  =  P(y'-y ),  in  order  to  approximate 
the  positions  of  the  x'(y ')  in  each  neighbourhood,  which  allows  us  to  interrelate  the  two 
components  of  o(y). 

4.1.  Nearest  neighbour  encoding 

The  y(x)  that  minimises  the  Euclidean  distortion  D  is  given  by 

>(*)  =  “*  ™  J dy'  />(y '|y)|x  -  x'(y')f  (9) 

where  ”arg  miny  ..."  means  "the  value  of  y  that  minimises  ...".  Because  the  expression  that 

is  minimised  is  Jdy'  />(y,|y)fx-x'(y,)||  ,  rather  than  llx-x'(y)ll2,  this  type  of  y(x)  is  called 

a  "minimum  distortion"  encoding  prescription,  rather  than  a  "nearest  neighbour"  encoding 
prescription.  Note  that  these  two  prescriptions  arc  the  same  when  the  neighbourhood 
function  collapses  to  zero  size,  i.e.  /5(y'ly)=5(y,-y). 

By  using  the  identity  (x-x'OOM*-*oOOM*'00-*oOO).  and  noting  that  the  cross  term 
vanishes  when  is  y  integrated,  we  may  rearrange  y(x)  into  the  form 

=  ”*  ™  (|*  -  x0^)|2  +  jdy'  **(  jIj)!*  *(  /)  -  *0  (y)f ) 

(10) 

=  “*  ™  (|k  -  xo  ( y)f  +  ^cc  o,  (y)) 
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Let  us  assume  that  the  trace  0,(y)  term  is  a  slowly  varying  function  of  y  compared  to  the 
ILr-jr0O)ll2  term,  then  we  may  approximate  y(x)  as 

y(x)^\\x-x0(y)f  (11) 

which  has  the  form  of  a  "nearest  neighbour"  encoding  prescription.  Therefore  the  x0(y)  act 
as  the  effective  code  vectors  in  a  "nearest  neighbour"  encoding  prescription  that  is 
approximately  equivalent  to  the  ideal  "minimum  distortion"  encoding  scheme. 

We  have  deduced  that  the  effect  of  the  neighbourhood  function  P(y'\y)  can  be  accounted  for 
by  replacing  "minimum  distortion"  encoding  by  an  approximately  equivalent  "nearest 
neighbour"  encoding  prescription,  which  is  not  simply  obtained  by  ignoring  the  effect  of 
P(y'\y)  altogether,  as  one  might  have  naively  assumed. 

We  shall  make  extensive  use  of  this  equivalence,  because  "nearest  neighbour"  encoding  is 
conceptually  simpler  than  "minimum  distortion"  encoding. 

4.2.  Covariance  matrix 

The  contribution  of  the  o2(y)  term  to  the  Euclidean  distortion  D  is  proportional  to 
llr(y)-jr0OOII2.  Therefore  minimising  D  tends  to  make  the  vectors  x(y)  and  x0(y)  become 
similar  to  each  other,  so  we  shall  henceforth  assume 

*(j)-*oO0  (12) 

This  approximation  says  that  the  centroid  x(y)  of  the  quantisation  cell  attached  to  y  is 
approximately  coincident  with  the  effective  code  vector  x0(y)  that  defines  the  quantisation 
cell  boundaries  via  a  "nearest  neighbour”  prescription. 

We  may  use  this  relationship  to  approximate  the  contributions  to  c(y)  as 

o0(  y)  *  J*1*  **(*!*)(*  -  *0  (  j))(*  -  *0(.y))T 

o2(y)~0 

where  we  repeat  the  exact  equation  for  o,(y),  for  convenience. 

We  use  the  x0(y)  to  implement  a  "nearest  neighbour"  encoding  scheme,  so  c(y)  reduces 
approximately  to  a  sum  OqOO+OjOO  of  two  contributions  which  we  may  readily  inteipret: 

(a)  c0(y)  is  the  covariance  matrix  of  the  "quantisation  cell"  attatched  to  y,  and  centred 
on  x0(y). 

(b)  o,(y)  is  the  covariance  matrix  of  the  vectors  x'(y')  having  mass  P(y'\y),  whose 
centroid  is  x0(y). 

This  simplification  of  the  interpretation  of  o(y)  is  possible  only  because  of  the  assumptions 
that  we  have  made,  which  arc  essentially  all  to  do  with  assuming  that  properties  vary  slowly 
on  the  scale  of  a  quantisation  cell. 
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4.3.  Lattice  of  code  vectors 

There  are  two  approximations  that  we  make  in  order  to  simplify  our  analysis  of  the  code 
vectors  x'iy'). 

1.  We  make  our  first  approximation  in  order  to  relate  Oj  (y)  to  a0iy)  in  a  simple  way. 
Thus  we  develop  a  first  order  Taylor  expansion  of  x'00  about  the  point  y'=y'0 

t  fo'Gp 

The  term  proportional  to  (y'-y'0 )  generates  a  linear  variation  of  x'OO  with  y\  wh'ch 
corresponds  to  a  uniform  lattice  of  vectors  x'iy'),  with  y'  indexing  the  lattice  points 
and  x'iy')  locating  the  lattice  points  in  x-space. 

The  omitted  higher  order  terms  would  introduce  non-uniformities  into  this  lattice. 
There  are  two  basic  types  of  contribution:  non-uniform  stretching  of  the  lattice,  and 
lattice  curvature.  These  two  contributions  arise  from  components  of  the  second 
derivative  d2x'(y'0)/dy'^dy'0  which  lie  parallel  or  perpendicular  to  the  surface  in  which 
dx'(y'0)/dy'0  lies,  respectively.  We  shall  assume  that  the  properties  of  x'iy')  vary 
sufficiently  slowly  on  the  scale  of  a  topographic  neighbourhood  that  we  may  ignore 
these  higher  order  corrections. 

2.  Now  we  make  our  second  approximation  in  order  to  ensure  that  the  relationship 
between  o,(y)  to  o0(y)  has  a  simple  dependence  on  y.  We  assume  that  P(y'ly)  is 
translation  invariant,  so  that 


(15) 

Equation  2  then  becomes 

*0(y)  =  \dy'  p(y'-y)x'(y') 

(16) 

which  ensures  that  the  lattice  of  x0(y)  is  locally  uniform,  because  it  is  a  convolution  of  a 
fixed  kernel  with  a  locally  uniform  lattice  of  x'iy').  This  requires  that  the  x0iy)  and  the  x'00 
lattices  are  locally  identical,  modulo  a  spatial  translation.  Furthermore,  this  uniformity 
implies  that  the  quantisation  cells  (implied  by  the  x0(y))  are  themselves  locally  identical  in 
shape. 

Under  these  conditions,  the  decomposition  c(y)  can  be  expressed  in  the  form 

o(y)  *  M(y)o0(y)M{y)T  (17) 

where  Miy)  is  a  matrix  that  transforms  o0(y)  into  o0(y)+o,(y).  Assuming  P(y'\y)  =  Piy'-y), 
otiy)  is  determined  uniquely  by  O0(y),  so  Miy)  must  have  the  approximate  form 

M{y)~  R(y)SRT(y) 

(18) 

MW1  *r(y) 

where  Riy)  is  an  orthogonal  matrix  (i.e.  a  rotation)  which  satisfies  Riy)RTiy)= 1,  and  where 
S  is  a  diagonal  matrix  (i.e.  a  scaling).  Note  that  Riy)  depends  on  y  because  it  must  rotate 


x'(y')  =  Jr'(yo)  +  (/-3'o) 
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o0(y)  into  a  standard  orientation  (i.e.  remove  the  peculiar  local  orientation  of  the  lattice  of 
x0(y)  vectors)  before  applying  the  invariant  scaling  S  (i.e.  transform  c0(y)  into  o0(y)+o,(y)). 

This  factorisation  of  a(y)  is  important,  because  it  allows  us  to  simplify  det(o(y))  thus 

det(o(y))  =  (detS)2deto0(y)  (19) 

This  result  simply  states  that  the  "volume"  of  the  full  covariance  o(y)  is  a  constant  factor 
times  the  "volume"  of  the  covariance  c0(y)  of  a  single  quantisation  cell.  This  result  emerges 
in  leading  order  because  we  assumed  P(y'ly)  =  P(y'-y). 

S.  Minimise  the  distortion 

We  now  minimise  the  Euclidean  distortion  D  with  respect  to  variations  of  the  components 
of  a(y),  whilst  holding  constant  the  total  number  A-  of  code  vectors.  In  Section  5.1  we  use 
the  covariance  matrix  o0(y)  to  define  a  code  vector  density,  which  is,  after  all,  the  quantity 
of  interest.  In  Section  5.2  we  minimise  D  with  respect  to  the  values  of  the  components  of 
o0(y)  whilst  holding  constant  the  total  number  of  code  vectors,  and  in  Section  5.3  we 
present  the  result  for  the  optimal  code  vector  density.  In  Section  5.4  we  demonstrate  how 
our  approach  specialises  to  the  case  of  minimising  the  Euclidean  distortion  in  a  standard 
vector  quantiser. 

5.1.  Code  vector  density 

The  code  vector  density  p(x)  is  defined  as  the  number  of  quantisation  cells  per  unit  volume 
of  x-space,  which  is  the  reciprocal  of  the  volume  of  the  quantisation  cells  in  the  locality  of 
x.  This  volume  may  easily  be  determined  from  the  covariance  matrix  c0(y),  to  yield  the 
following  definition  of  code  vector  density 

p(x)  =  (detc0(y(x)))'V2  (20) 

The  total  number  of  code  vectors  K  is  the  integral  over  all  x  of  the  code  vector  density,  so 
K  =  \dx  p(x) 

=  Jdx(deta0(y(x)))'1/2  (21) 

*  detS  Jdx  deta(y(x)) 

where  we  used  the  factorisation  of  o(y)  given  in  Equation  17.  This  expression  for  K  is  non¬ 
trivial,  because  it  required  a  lot  of  work  in  Section  4  to  derive  Equation  17.  Indeed,  this  was 
the  main  purpose  of  Section  4. 

5.2.  Stationary  distortion 

Finally,  we  have  all  of  the  basic  results  that  we  need  in  order  to  minimise  the  Euclidean 
distortion  D  with  respect  to  the  values  of  the  components  of  C(y),  whilst  holding  constant 
the  total  number  K  of  code  vectors. 
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We  may  impose  the  constraint  by  introducing  a  Lagrange  multiplier  X  and  locating  the 
stationary  point  of 

D(X)sD  +  XK 

(22) 

*  J  dr(/>(x)traceo(<y(x))  +  XdetS  deta(y(x))) 

Functionally  differentiate  D(k)  with  respect  to  the  components  o^y)  of  the  covariance 
matrix  a(y),  using  the  results 

S trace  o(w)  c  . . , 


so.  (•?) 

Sdet  q(n') 


=  8* 


=  a(w)TI  deto(n')  6(w-y) 


where  g(h>)~'  is  the  (/,0-th  component  of  o(m’)"'  .  This  yields 

=  Jdi8(y-y(x))(p(x)5  --^^o(y(x))^deto(y(x))l  (25) 

8ov(y)  J  2  J 

If  we  assume  that  P(x)  varies  slowly  as  x  ranges  over  those  values  that  satisfy  y=y(x),  then 
we  can  satisfy  the  condition  8D(X)/Soy(y)=0  by  choosing 

o,OU))« (26) 

It  is  important  to  note  that  the  x-dependence  of  this  result  does  not  depend  on  the  details  of 
the  topographic  neighbourhood  function  P(y'\y),  other  than  its  assumed  translation 
invariance  P(y'\y)=P(y'-y),  and  the  slow  variation  assumptions  that  we  made  earlier. 

Note  that  the  result  in  Equation  26  can  easily  be  generalised  to  an  Lr  distortion  metric,  by 
replacing  the  trace  c(y)  by  (trace  o(y))'n,  to  obtain  eventually  o^{y(x))«P(x)-2/(V+r)8y. 

5.3.  Stationary  code  vector  density 

Finally,  we  may  use  the  factorisation  of  o(y)  in  Equation  17  to  write  the  covariance  of  each 
quantisation  cell  o0(x)  as 

o0(x)  ~  P(x)~ViN*2)  R{y(x))S~2  R{y(x))T  (27) 

and,  using  the  definition  of  code  vector  density  in  Equation  20,  p(x)  as 

p(x)~  P(x)NI(N*2)  (28) 

which  easily  generalises  to  p(x)o*P(x)A//(W+r)  for  an  Lr  distortion  metric. 

The  result  in  Equation  28  reveals  that  the  code  vector  density  does  not  depend  on  the  form 
of  the  topographic  neighbourhood  function.  However,  note  that  we  made  the  following 
assumptions  in  order  to  derive  this  result: 

1 .  P(y'\y)-P(y'-y ),  so  that  local  translation  invariant  solutions  can  develop. 
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2.  P(y'-y)  is  "local",  so  that  smooth  solutions  can  develop. 

3.  K  is  "large",  so  that  local  variations  of  P(x)  and  higher  order  (curvature)  corrections 
to  the  lattice  of  code  vectors  can  be  ignored. 

5.4.  Special  case:  vector  quantiser 

In  the  case  of  a  standard  vector  quantiser  (with  a  zero  width  topographic  neighbourhood 
function),  a  simplified  version  of  our  approach  can  be  used.  The  distortion  D  is  then  given 
by 

D  s  j  dx  P(x)\\x  -  x(y)f 

(29) 

=  J  dx  P(x)  tracea(y(x)) 

where  P(y'\y)=S(y'-y)  and  o(y)=o0(y)  (i.e.  we  do  not  need  to  consider  extra  terms  arising 
from  a  non-zero  topographic  neighbourhood).  The  code  vector  density  p(x)  is  then  given  by 

p(x)  =  (det  a(y(x)))'V2  (30) 

We  could  minimise  the  distortion  with  respect  to  the  components  cy(y)  of  the  covariance 
matrix  o(y),  as  we  have  already  done.  However,  it  is  simpler  to  argue  straight  away  that 
o(y)  must  be  isotropic  (i.e.  oy(y)=s(y) 5y),  because  locally  there  are  no  special  directions  in 
x-space.  Thus  we  can  write 

traceo(y)  =  N  s(y) 

(31) 

deto(y)  =  s(y)N 

and  then  minimise  the  distortion  with  respect  to  the  scalar  s(y),  or,  equivalently,  with 
respect  to  the  code  vector  density  p(x)  (=s(y(x))  sa)  itself. 

If  we  express  this  minimisation  problem  in  terms  of  p(x),  rather  than  s(y),  it  becomes 

— (fdxP(x)p(x)‘2/A'  +  X[<lxp(x))  =  0  (32) 

8p(x)  4 

whose  solution  is  p(x) «  P(x)A//(,v'+2),  as  expected. 

This  result  can  easily  be  generalised  to  an  Lr  distortion  metric  by  replacing  the  p(x)VN  by 
p(x)-rfN,  to  obtain  p(x)  «  P(x)N/^+r\  This  derivation  of  the  optimum  code  vector  density  is 
simpler,  and  more  intuitive,  than  the  one  presented  in  [8]. 

6.  Conclusions 

In  this  memorandum  we  have  examined  the  properties  of  a  special  type  of  vector  quantiser, 
called  a  topographic  vector  quantiser  because  of  its  similarity  to  a  topographic  mappings. 
The  only  difference  between  the  respective  training  algorithms  is  that  topographic  mappings 
use  a  "nearest  neighbour"  encoding  prescription,  whereas  topographic  vector  quantisers  use 
a  "minimum  distortion"  encoding  prescription. 


Code  Vector  Density 


We  have  derived  the  leading  order  properties  of  the  code  vectors  of  topographic  vector 
quantisers,  and  we  have  shown  that  the  code  vector  density  is  insensitive  to  one's  choice  of 
topographic  neighbourhood  function,  provided  that  it  is  of  convolution  type  (i.e.  the  same 
for  all  code  vectors).  This  result  stands  in  stark  contrast  to  the  strong  dependence  of  the 
code  vector  density  on  topographic  neighbourhood  function  in  a  standard  topographic 
mapping. 

This  result  strongly  suggests  that  minimum  distortion  encoding  is  a  more  theoretically 
respectable  encoding  prescription  than  nearest  neighbour  encoding.  Indeed,  minimum 
distortion  is  a  derived  prescription,  whereas  nearest  neighbour  is  an  assumed  prescription, 
so  the  nearest  neighbour  prescription  should  be  used  with  suspicion. 

7.  Notation 

Define  the  basic  vectors  and  functions, 
x  input  vector 

y  code 

y  distorted  code 

y(x)  encoding  function  -  transform  from  x-space  to  y-space 
x'(y ')  decoding  function  -  transform  from  /-space  to  x-space 
N  the  dimensionality  of  the  input  vector 

D  the  average  Euclidean  distortion  between  input  and  reconstruction 
p(x)  the  code  vector  density 

K  the  total  number  of  code  vectors 

X.  a  Lagrange  multiplier 

Define  the  basic  densities. 

P(x)  density  of  inputs 

P(y\x)  density  of  codes  (given  that  the  input  is  known)  -  assumed  to  be  S(y-y(x)) 
P(y'\y)  density  of  distorted  codes  (given  that  the  code  is  known) 

Define  the  derived  densities  (obtained  using  Bayes'  theorem). 

P(xly)  posterior  density  of  inputs  (given  that  the  code  is  known) 

P(x\y')  posterior  density  of  inputs  (given  that  the  distorted  code  is  known) 

/>(yl/)  posterior  density  of  codes  (given  that  the  distorted  code  is  known) 

Define  the  derived  functions  (obtained  using  the  posterior  densities,  etc) 
x(y)  the  centroid  of  x  (given  that  y  is  known) 
x0(y)  the  centroid  of  x'{y')  (given  that  y  is  known) 
o(y)  the  matrix  whose  average  trace  is  the  Euclidean  distortion 

r 

o0(y)  the  contribution  to  o(y)  from  a  quantisation  cell 
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0,0)  the  contribution  to  o(y)  from  the  topographic  neighbourhood  of  a  cell 
MO)  the  transformation  which  converts  o00)  into  o(y) 

R(y)  the  "rotate"  part  of  M(y) 

S  the  "scale”  part  of  MO) 
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